Simple statistics using R

First steps

Author
Affiliation

Julien Martin

University of Ottawa

Published

April 4, 2024

0.1 learning outcomes

  • introduce you to some basic statistics in R ✔️

  • focus on linear models ✔️

  • fit simple linear models in R ✔️

  • check linear model assumptions in R ✔️

1 statistics using R

1.1 Scratching the surface

  • many, many statistical tests available in R

  • range from the simple to the highly complex

  • many are included in standard base installation of R

  • you can extend the range of statistics by installing additional packages

1.2 an example

  • does seeding clouds with dimethylsulphate alter the moisture content of clouds (can we make it rain!)

  • 10 random clouds were seeded and 10 random clouds unseeded

  • what’s the null hypothesis?

  • no difference in mean moisture content between seeded and unseeded clouds

1.3 Plotting the data

  • plot these data

  • interpretation?

  • what type of statistical test do you want to use?

clouds <- read.csv('data/clouds.csv')
str(clouds)
'data.frame':   20 obs. of  2 variables:
 $ moisture : num  301 302 299 316 307 ...
 $ treatment: chr  "seeded" "seeded" "seeded" "seeded" ...

1.4 t-test

t.test(clouds$moisture~clouds$treatment, var.equal=TRUE)

    Two Sample t-test

data:  clouds$moisture by clouds$treatment
t = 2.5404, df = 18, p-value = 0.02051
alternative hypothesis: true difference in means between group seeded and group unseeded is not equal to 0
95 percent confidence interval:
  1.482679 15.657321
sample estimates:
  mean in group seeded mean in group unseeded 
                303.63                 295.06 
  • reject or fail to reject the null hypothesis?

1.5 linear models in R

  • an alternative, but equivalent approach is to use a linear model to compare the means in each group

  • general linear models are generally thought of as simple models, but can be used to model a wide variety of data and exp. designs

  • traditionally statistics is performed (and taught) like using a recipe book (ANOVA, t-test, ANCOVA etc)

  • general linear models provide a coherent and theoretically satisfying framework on which to conduct your analyses

1.6 what are linear models?

  • t-test

  • ANOVA

  • factorial ANOVA

  • ANCOVA

  • linear regression

  • multiple regression

  • etc, etc

1.7 model formula

  • general linear modelling is based around the concept of model formulae


response variable ~ explanatory variable(s) + error


  • literally read as ‘variation in response variable modelled as a function of the explanatory variable(s) plus variation not explained by the explanatory variables’

  • it’s the attributes of the response and explanatory variables that determines the type of linear model fitted

1.8 linear models in R

  • the function for carrying out linear regression in R is lm()

-the response variable comes first, then the tilde ~ then the name of the explanatory variable

clouds.lm <- lm(moisture ~ treatment, data=clouds)
  • how does R know that you want to perform a t-test (ANOVA)?
class(clouds$treatment)
[1] "character"
  • here the explanatory variable is a factor

1.9 linear models in R

  • to display the ANOVA table use the anova() function
anova(clouds.lm)
Analysis of Variance Table

Response: moisture
          Df  Sum Sq Mean Sq F value  Pr(>F)  
treatment  1  367.22  367.22  6.4538 0.02051 *
Residuals 18 1024.20   56.90                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • do you notice anything familiar about the p value?

  • (hint: see the output from the t-test we did earlier)

1.10 linear models in R

  • we have sufficient evidence to reject the null hypothesis (as before)

  • therefore, there is a significant difference in the mean moisture content between clouds that were seeded and unseeded clouds

  • do we accept this inference?

  • what about assumptions?

  • we could use Shapiro-Wilks and F tests as before

  • much better to assess visually by plotting the residuals

1.11 Plotting residuals

  • clouds.lm is a linear model object we can do stuff with it

  • we can use the plot() function directly to display residual plots

  • normality assumption

  • equal variance assumption

  • unusual or influential observations

1.12 Plotting residuals

par(mfrow = c(2, 2), bg = "#FFFFFFCC")
plot(clouds.lm)

1.13 other linear models

traditional name model formula R code
simple linear regression Y ~ X1 (continuous) lm(Y ~ X)
one-way ANOVA Y ~ X1 (categorical) lm(Y ~ X)
two-way ANOVA Y ~ X1 (cat) + X2 (cat) lm(Y ~ X1 + X2)
ANCOVA Y ~ X1 (cat) + X2 (cont) lm(Y ~ X1 * X2)
multiple regression Y ~ X1 (cont) + X2 (cont) lm(Y ~ X1 + X2)
factorial ANOVA Y ~ X1 (cat) * X2 (cat) lm(Y ~ X1 * X2)

2 Thanks!

Credit: I borrowed slides from Alex Douglas.